Machine Learning Analysis Pipeline
EDR: Dataset Loading & Preprocessing
EDR – Train/Test Overview
• Train shape: (88089, 20) | Test shape: (7533, 20)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (88089, 20) | Test shape: (7533, 20)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9902
EDR: Model Performance Comparison
EDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9361 | 0.6400 | 0.0547 | 0.3378 | 0.0942 | 0.6447 | 0.0498 |
| Random Forest (SMOTE) | 0.8703 | 0.6067 | 0.0262 | 0.3378 | 0.0487 | 0.8074 | 0.0548 |
| LightGBM | 0.8418 | 0.6726 | 0.0310 | 0.5000 | 0.0585 | 0.8183 | 0.0980 |
| Balanced RF | 0.8824 | 0.6864 | 0.0407 | 0.4865 | 0.0752 | 0.8577 | 0.0906 |
| SGD SVM | 0.9480 | 0.5857 | 0.0457 | 0.2162 | 0.0755 | nan | nan |
| IsolationForest | 0.9821 | 0.5494 | 0.1039 | 0.1081 | 0.1060 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 7027 | 432 | 49 | 25 | 5.79% | 66.22% |
| Random Forest (SMOTE) | 6531 | 928 | 49 | 25 | 12.44% | 66.22% |
| LightGBM | 6304 | 1155 | 37 | 37 | 15.48% | 50.00% |
| Balanced RF | 6611 | 848 | 38 | 36 | 11.37% | 51.35% |
| SGD SVM | 7125 | 334 | 58 | 16 | 4.48% | 78.38% |
| IsolationForest | 7390 | 69 | 66 | 8 | 0.93% | 89.19% |
Best Models by Metric
Accuracy
IsolationForest
0.9821
Balanced Acc
Balanced RF
0.6864
Precision
IsolationForest
0.1039
Recall
LightGBM
0.5000
F1
IsolationForest
0.1060
ROC-AUC
Balanced RF
0.8577
PR-AUC
LightGBM
0.0980
Lowest False Positive Rate
IsolationForest
0.93%
Lowest Miss Rate
LightGBM
50.00%
EDR – Metrics by Model
EDR – ROC Curves
EDR – Precision–Recall Curves
EDR – Predicted Probability Distributions
EDR – Threshold Sweep
EDR: Logistic Regression – Detailed Analysis
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9931 | 0.9421 | 0.9669 | 7459.0000 |
| 1 | 0.0547 | 0.3378 | 0.0942 | 74.0000 |
| accuracy | nan | nan | 0.9361 | 7533.0000 |
EDR – Logistic Regression: Feature Importance
EDR – Logistic Regression: Feature Importance
EDR: Random Forest (SMOTE) – Detailed Analysis
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9926 | 0.8756 | 0.9304 | 7459.0000 |
| 1 | 0.0262 | 0.3378 | 0.0487 | 74.0000 |
| accuracy | nan | nan | 0.8703 | 7533.0000 |
EDR – Random Forest (SMOTE): Feature Importance
EDR – Random Forest (SMOTE): Feature Importance
EDR: LightGBM – Detailed Analysis
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9942 | 0.8452 | 0.9136 | 7459.0000 |
| 1 | 0.0310 | 0.5000 | 0.0585 | 74.0000 |
| accuracy | nan | nan | 0.8418 | 7533.0000 |
EDR – LightGBM: Feature Importance
EDR – LightGBM: Feature Importance
EDR: Balanced RF – Detailed Analysis
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9943 | 0.8863 | 0.9372 | 7459.0000 |
| 1 | 0.0407 | 0.4865 | 0.0752 | 74.0000 |
| accuracy | nan | nan | 0.8824 | 7533.0000 |
EDR – Balanced RF: Feature Importance
EDR – Balanced RF: Feature Importance
EDR: SGD SVM – Detailed Analysis
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9919 | 0.9552 | 0.9732 | 7459.0000 |
| 1 | 0.0457 | 0.2162 | 0.0755 | 74.0000 |
| accuracy | nan | nan | 0.9480 | 7533.0000 |
EDR – SGD SVM: Feature Importance
EDR – SGD SVM: Feature Importance
EDR: IsolationForest – Detailed Analysis
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9911 | 0.9907 | 0.9909 | 7459.0000 |
| 1 | 0.1039 | 0.1081 | 0.1060 | 74.0000 |
| accuracy | nan | nan | 0.9821 | 7533.0000 |
EDR – IsolationForest: Feature Importance
Feature importance not available for this model type.
XDR: Dataset Loading & Preprocessing
XDR – Train/Test Overview
• Train shape: (88089, 34) | Test shape: (7533, 34)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (88089, 34) | Test shape: (7533, 34)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9902
XDR: Model Performance Comparison
XDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9356 | 0.5996 | 0.0423 | 0.2568 | 0.0727 | 0.6560 | 0.0462 |
| Random Forest (SMOTE) | 0.9034 | 0.5900 | 0.0288 | 0.2703 | 0.0521 | 0.7988 | 0.0670 |
| LightGBM | 0.8756 | 0.6897 | 0.0395 | 0.5000 | 0.0732 | 0.8509 | 0.1250 |
| Balanced RF | 0.8937 | 0.6921 | 0.0451 | 0.4865 | 0.0825 | 0.8588 | 0.0929 |
| SGD SVM | 0.8145 | 0.5786 | 0.0182 | 0.3378 | 0.0346 | nan | nan |
| IsolationForest | 0.9881 | 0.5190 | 0.1364 | 0.0405 | 0.0625 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 7029 | 430 | 55 | 19 | 5.76% | 74.32% |
| Random Forest (SMOTE) | 6785 | 674 | 54 | 20 | 9.04% | 72.97% |
| LightGBM | 6559 | 900 | 37 | 37 | 12.07% | 50.00% |
| Balanced RF | 6696 | 763 | 38 | 36 | 10.23% | 51.35% |
| SGD SVM | 6111 | 1348 | 49 | 25 | 18.07% | 66.22% |
| IsolationForest | 7440 | 19 | 71 | 3 | 0.25% | 95.95% |
Best Models by Metric
Accuracy
IsolationForest
0.9881
Balanced Acc
Balanced RF
0.6921
Precision
IsolationForest
0.1364
Recall
LightGBM
0.5000
F1
Balanced RF
0.0825
ROC-AUC
Balanced RF
0.8588
PR-AUC
LightGBM
0.1250
Lowest False Positive Rate
IsolationForest
0.25%
Lowest Miss Rate
LightGBM
50.00%
XDR – Metrics by Model
XDR – ROC Curves
XDR – Precision–Recall Curves
XDR – Predicted Probability Distributions
XDR – Threshold Sweep
XDR: Logistic Regression – Detailed Analysis
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9922 | 0.9424 | 0.9667 | 7459.0000 |
| 1 | 0.0423 | 0.2568 | 0.0727 | 74.0000 |
| accuracy | nan | nan | 0.9356 | 7533.0000 |
XDR – Logistic Regression: Feature Importance
XDR – Logistic Regression: Feature Importance
XDR: Random Forest (SMOTE) – Detailed Analysis
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9921 | 0.9096 | 0.9491 | 7459.0000 |
| 1 | 0.0288 | 0.2703 | 0.0521 | 74.0000 |
| accuracy | nan | nan | 0.9034 | 7533.0000 |
XDR – Random Forest (SMOTE): Feature Importance
XDR – Random Forest (SMOTE): Feature Importance
XDR: LightGBM – Detailed Analysis
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9944 | 0.8793 | 0.9333 | 7459.0000 |
| 1 | 0.0395 | 0.5000 | 0.0732 | 74.0000 |
| accuracy | nan | nan | 0.8756 | 7533.0000 |
XDR – LightGBM: Feature Importance
XDR – LightGBM: Feature Importance
XDR: Balanced RF – Detailed Analysis
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9944 | 0.8977 | 0.9436 | 7459.0000 |
| 1 | 0.0451 | 0.4865 | 0.0825 | 74.0000 |
| accuracy | nan | nan | 0.8937 | 7533.0000 |
XDR – Balanced RF: Feature Importance
XDR – Balanced RF: Feature Importance
XDR: SGD SVM – Detailed Analysis
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9920 | 0.8193 | 0.8974 | 7459.0000 |
| 1 | 0.0182 | 0.3378 | 0.0346 | 74.0000 |
| accuracy | nan | nan | 0.8145 | 7533.0000 |
XDR – SGD SVM: Feature Importance
XDR – SGD SVM: Feature Importance
XDR: IsolationForest – Detailed Analysis
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9905 | 0.9975 | 0.9940 | 7459.0000 |
| 1 | 0.1364 | 0.0405 | 0.0625 | 74.0000 |
| accuracy | nan | nan | 0.9881 | 7533.0000 |
XDR – IsolationForest: Feature Importance
Feature importance not available for this model type.